HadoopProv: Towards Provenance as a First Class Citizen in MapReduce

نویسندگان

  • Sherif Akoush
  • Ripduman Sohan
  • Andy Hopper
چکیده

We introduce HadoopProv, a modified version of Hadoop that implements provenance capture and analysis in MapReduce jobs. It is designed to minimise provenance capture overheads by (i) treating provenance tracking in Map and Reduce phases separately, and (ii) deferring construction of the provenance graph to the query stage. Provenance graphs are later joined on matching intermediate keys of the Map and Reduce provenance files. In our prototype implementation, HadoopProv has an overhead below 10% on typical job runtime (<7% and <30% average temporal increase on Map and Reduce tasks respectively). Additionally, we demonstrate that provenance queries are serviceable in O ( k log n ) , where n is the number of records per Map task and k is the set of Map tasks in which the key appears.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

PASSing the provenance challenge

Provenance-aware storage systems (PASS) are a new class of storage system treating provenance as a first-class object, providing automatic collection, storage, and management of provenance as well as query capabilities. We developed the first PASS prototype between 2005 and 2006, targeting scientific end users. Prior to undertaking the provenance challenge, we had focused on provenance collecti...

متن کامل

RAMP: A System for Capturing and Tracing Provenance in MapReduce Workflows

RAMP (Reduce And Map Provenance) is an extension to Hadoop that supports provenance capture and tracing for workflows of MapReduce jobs. RAMP uses a wrapper-based approach, requiring little if any user intervention in most cases, while retaining Hadoop’s parallel execution and fault tolerance. We demonstrate RAMP on a real-world MapReduce workflow generated from a Pig script that performs senti...

متن کامل

Scalable Progressive Analytics on Big Data in the Cloud

Analytics over the increasing quantity of data stored in the Cloud has become very expensive, particularly due to the pay-as-you-go Cloud computation model. Data scientists typically manually extract samples of increasing data size (progressive samples) using domain-specific sampling strategies for exploratory querying. This provides them with user-control, repeatable semantics, and result prov...

متن کامل

بررسی میزان شناخت جوانان از حقوق و مطالبات شهروندی(مطالعه موردی: جوانان شهرستان سقز)

The aim of thesis is checking cognition and knowledge rate about claims and citizen ship rights among the youth in sghez city. Political, social, cultural, economical relations have changed rapidly in the last de Cades, especially in the 1990s. And Iran has passed major transformation and changes. This transformation includes a large spectrum of fundamental changes and behavioral patterns...

متن کامل

Temporal Provenance Model (TPM): Model and Query Language

Provenance refers to the documentation of an object’s lifecycle. This documentation (often represented as a graph) should include all the information necessary to reproduce a certain piece of data or the process that led to it. In a dynamic world, as data changes, it is important to be able to get a piece of data as it was, and its provenance graph, at a certain point in time. Supporting time-a...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013